Iterative Rule Segmentation under Minimum Description Length for Unsupervised Transduction Grammar Induction

نویسندگان

Markus Saers

Karteek Addanki

Dekai Wu

چکیده

We argue that for purely incremental unsupervised learning of phrasal inversion transduction grammars, a minimum description length driven, iterative top-down rule segmentation approach that is the polar opposite of Saers, Addanki, and Wu’s previous 2012 bottom-up iterative rule chunking model yields significantly better translation accuracy and grammar parsimony. We still aim for unsupervised bilingual grammar induction such that training and testing are optimized upon the same exact underlying model—a basic principle of machine learning and statistical prediction that has become unduly ignored in statistical machine translation models of late, where most decoders are badly mismatched to the training assumptions. Our novel approach learns phrasal translations by recursively subsegmenting the training corpus, as opposed to our previous model—where we start with a token-based transduction grammar and iteratively build larger chunks. Moreover, the rule segmentation decisions in our approach are driven by a minimum description length objective, whereas the rule chunking decisions were driven by a maximum likelihood objective. We demonstrate empirically how this trades off maximum likelihood against model size, aiming for a more parsimonious grammar that escapes the perfect overfitting to the training data that we start out with, and gradually generalizes to previously unseen sentence translations so long as the model shrinks enough to warrant a looser fit to the training data. Experimental results show that our approach produces a significantly smaller and better model than the chunking-based approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Top-down and Bottom-up Search for Unsupervised Induction of Transduction Grammars

We show that combining both bottom-up rule chunking and top-down rule segmentation search strategies in purely unsupervised learning of phrasal inversion transduction grammars yields significantly better translation accuracy than either strategy alone. Previous approaches have relied on incrementally building larger rules by chunking smaller rules bottomup; we introduce a complementary top-down...

متن کامل

Unsupervised Learning of Bilingual Categories in Inversion Transduction Grammar Induction

We present the first known experiments incorporating unsupervised bilingual nonterminal category learning within end-to-end fully unsupervised transduction grammar induction using matched training and testing models. Despite steady recent progress, such induction experiments until now have not allowed for learning differentiated nonterminal categories. We divide the learning into two stages: (1...

متن کامل

Learning Bilingual Categories in Unsupervised Inversion Transduction Grammar Induction

متن کامل

Unsupervised Transduction Grammar Induction via Minimum Description Length

We present a minimalist, unsupervised learning model that induces relatively clean phrasal inversion transduction grammars by employing the minimum description length principle to drive search over a space defined by two opposing extreme types of ITGs. In comparison to most current SMT approaches, the model learns a very parsimonious phrase translation lexicons that provide an obvious basis for...

متن کامل

Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation

We present a novel model, Freestyle, that learns to improvise rhyming and fluent responses upon being challenged with a line of hip hop lyrics, by combining both bottomup token based rule induction and top-down rule segmentation strategies to learn a stochastic transduction grammar that simultaneously learns both phrasing and rhyming associations. In this attack on the woefully under-explored n...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Iterative Rule Segmentation under Minimum Description Length for Unsupervised Transduction Grammar Induction

نویسندگان

چکیده

منابع مشابه

Combining Top-down and Bottom-up Search for Unsupervised Induction of Transduction Grammars

Unsupervised Learning of Bilingual Categories in Inversion Transduction Grammar Induction

Learning Bilingual Categories in Unsupervised Inversion Transduction Grammar Induction

Unsupervised Transduction Grammar Induction via Minimum Description Length

Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation

عنوان ژورنال:

اشتراک گذاری